My Data set consist of 4898 white wines with 11 variables Data fields Input variables (based on physicochemical tests): 1 - fixed acidity , 2 - volatile acidity , 3 - citric acid , 4 - residual sugar , 5 - chlorides , 6 - free sulfur dioxide , 7 - total sulfur dioxide , 8 - density , 9 - pH , 10 - sulphates , 11 - alcohol,
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Other:
13 - id (unique ID for each sample, needed for submission)
head(pf)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
Summary
summary(pf)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
PLOT EACH VARIABLE
ggplot(aes(x=fixed.acidity),data=pf)+
geom_histogram(binwidth = 1)+
scale_x_continuous(limits = c(3,15),breaks = seq(3,15,1))
summary(pf$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
We can see that minimum fixed acidity is 3.8 and Max fixed acidity is 14.2 . Most of wines have average fixed acidity.
ggplot(aes(x=volatile.acidity),data=pf)+
geom_histogram(binwidth = 0.2) +
scale_x_continuous(limits = c(0,1.2),breaks = seq(0,1.2,0.2))
summary(pf$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
We can see that min volatile acidity is 0.080 and max volatile acidity is 1.100 . Most of wines have average volatile acidity.
ggplot(aes(x=citric.acid),data=pf)+
geom_histogram(binwidth = 0.1) +
scale_x_continuous(limits = c(0,1.7),breaks = c(0,1.7,0.1))
summary(pf$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
We can see that min citric acid is 0.00 and max citric acid is 1.66.Most of wines have average citric acid
ggplot(aes(x=residual.sugar),data=pf)+
geom_histogram(binwidth = 5)+
scale_x_continuous(limits = c(0,67),breaks = seq(0,67,5))
summary(pf$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
We can see that min residual sugar is 0.6 and max sugar is 65.8 .We can see that best amount to give sugar is approx 5.2.Most of wines have average residual sugar.
ggplot(aes(x=chlorides),data=pf)+
geom_histogram(binwidth = 0.05) +
scale_x_continuous(limits =c(0.0,0.35),breaks = seq(0.0,0.35,0.05))
summary(pf$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
We can see that min chloride in 0.009 and max chloride is 0.346.We can see that by incresing the amount of chloride till median 0.043 .Most of wines have average chlorides .
ggplot(aes(x=free.sulfur.dioxide),data=pf)+
geom_histogram(binwidth = 0.05)+
scale_x_continuous(limits = c(0,300),breaks = seq(0,300,30))
summary(pf$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
We can see that min free sulphur dioxide in 2.00 and max free sulphur dioxide is 289.0.We can see that amount of free sulphur dioxide are increasing till its median value 34.00 and then it is decresing.But we can say that most of wines have average free.sulfur.dioxide .
ggplot(aes(x=total.sulfur.dioxide),data=pf)+
geom_histogram(binwidth = 0.05)+
scale_x_continuous(limits = c(9,450),breaks = seq(9,450,40))
summary(pf$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
We can see that min total sulphur dioxide in 9.00 and max total sulphur dioxide is 440.0.We can see that amount of total sulphur dioxide are increasing till its median value 134.00 and then it is decresing.But we can say that most of wines have average total.sulfur.dioxide .
ggplot(aes(x=density),data=pf)+
geom_histogram(binwidth = 0.0002)+
scale_x_continuous(limits = c(0.975,1.039),breaks = seq(0.975,1.039,0.01))
summary(pf$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
We can see that density is similar in most of the wines.Density median is 0.9937.
ggplot(aes(x=pH),data=pf)+
geom_histogram(binwidth = 0.06)+
scale_x_continuous(limits = c(2.70,3.820),breaks = seq(2.70,3.820,0.06))
## Warning: Removed 2 rows containing missing values (geom_bar).
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
We can see that min pH in 2.720 and max pH is 3.820.We can see that amount of pH are increasing till its median value 3.18 and then it is decresing.But we can say that most of wines have average pH.
ggplot(aes(x=sulphates),data=pf)+
geom_histogram(binwidth = 0.06)+
scale_x_continuous(limits = c(0.2200,1.0800),breaks = seq(0.2200,1.080,0.06))
summary(pf$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
We can see that min sulphates is 0.2200 and max sulphates is 1.0800.We can see that amount of sulphates are increasing till its median value 0.4700 and then it is decresing.But we can say that most of wines have average sulphates .
ggplot(aes(x=alcohol),data=pf)+
geom_histogram(binwidth = 0.5)+
scale_x_continuous(limits = c(8.00,14.20),breaks = seq(8.00,14.20,0.5))
summary(pf$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We can see that min alcohol is 8.00 and max alcohol is 14.20.We can see that amount of alcohol are increasing till its median value 10.40 and then it is decresing.But we can say that most of wines have average alcohol .
ggplot(aes(x=quality),data=pf)+
geom_bar()+
labs(title="Wine quality Distribution(barchart)")+
scale_x_continuous(limits = c(3.00,9.00),breaks = seq(3.00,9.00,1))
summary(pf$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
We can see that min quality is 3.00 and max quality is 9.00.We can see that most of the wine have avg 6.00 quality
pf$alcohol_percentage<-cut(pf$alcohol,c(8,10,12,14,16))
head(pf$alcohol_percentage)
## [1] (8,10] (8,10] (10,12] (8,10] (8,10] (10,12]
## Levels: (8,10] (10,12] (12,14] (14,16]
Answer:- This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number
Other observations:
Average quality of wine is 5.878 By incresing the amount of ingredient till their median value quality of wine incresing. By increasing the quantity of ingredient above their medain quality of wine decreses. ## What is/are the main feature(s) of interest in your dataset? The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine ## What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates and alcohol likely contribute to the quality of a white wine I think alcohole contribute most to the quality after researching information on quality of wine
I created a variable for the alcholor percentatage group of wine using the alcohol. This arose in the bivariate section of my analysis when I explored how the quality of a wine varied with its alcohol percentage. At first alcohol percentage grouping was calculated by diving the alcohol percentage into four groups
I calculated the alcohol percentage distribution and find its correlation.Since it is strongly related to quality of wine.I have calculated pH distribution and find its correlation.Its correlated to wine quality.
head(pf)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality alcohol_percentage
## 1 6 (8,10]
## 2 6 (8,10]
## 3 6 (10,12]
## 4 6 (8,10]
## 5 6 (8,10]
## 6 6 (10,12]
The dimensions of a white wine tend to correlate with each other. The longer one dimension, then the quality of wine is overall. The dimensions also correlate with other variables. Price correlates strongly with alcohol and other variable also
## Warning in ggcorr(subset(pf, select = -c(X)), method = c("all.obs",
## "spearman"), : data in column(s) 'alcohol_percentage' are not numeric and
## were ignored
cor.test(pf$alcohol,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
cor.test(pf$alcohol,pf$pH)
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09374446 0.14893205
## sample estimates:
## cor
## 0.1214321
cor.test(pf$pH,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$pH and pf$quality
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07162022 0.12707983
## sample estimates:
## cor
## 0.09942725
ggplot(aes(x=alcohol,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
summary(pf$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We can see that min alcohol in 8.2 and max alcohol is 14.2.We can see that by incresing the amount of alcohol till median 10.40 ,Its quality is incresing.After that by incresing quantity of alcohol quality of wine decreses.
ggplot(aes(x=sulphates,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
summary(pf$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
We can see that min sulphates in 0.220 and max sulphates is 1.080.We can see that by incresing the amount of sulphates till median 0.4700 ,Its quality is incresing.After that by incresing quantity of sulphates quality of wine decreses.
ggplot(aes(x=pH,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
We can see that min pH in 2.720 and max pH is 3.82.We can see that by incresing the amount of pH till median 3.180 ,Its quality is incresing.After that by incresing quantity of pH quality of wine decreses.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),inverse = function(x) x^3)
From a subset of the data,fixed acidity ,total sulphur dioxide do not seem to have strong correlations with quality, but alcohol and pH are moderately correlated with quality. I want to look closer at scatter plots involving quality and some other variables like fixed acidity,alcohol,pH.
ggplot(aes(fixed.acidity,quality),data=pf)+
geom_jitter()+
scale_x_continuous(trans = cuberoot_trans(),limits = c(6,14),
breaks = c(6,8,10,12,14))+
scale_y_continuous(trans = log10_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of fixed acidity')
ggplot(aes(free.sulfur.dioxide,quality),data=pf)+
geom_jitter()+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = log10_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide')
As free sulphur dioxide quantity increases, the variance in quality increases. We can see that till median value of sulphur dioxide Quality increses more.After that its start decresing.
cor.test(pf$free.sulfur.dioxide,pf$quality)
##
## Pearson's product-moment correlation
##
## data: pf$free.sulfur.dioxide and pf$quality
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01985292 0.03615626
## sample estimates:
## cor
## 0.008158067
newDataA<-pf[!is.na(pf$alcohol_percentage),]
ggplot(aes(x=alcohol_percentage,y=quality),data=newDataA)+
geom_boxplot()
summary(pf$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal alcohol percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as alcohol percentage improves and then decreases for wine quality with increse in alcohol percentage above median value.
summary(pf$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pf$pH_group<-cut(pf$pH,c(2.720,3.02,3.32,3.62,3.82))
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "alcohol_percentage" "pH_group"
newData<-pf[!is.na(pf$pH_group),]
head(pf$pH_group)
## [1] (2.72,3.02] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32]
## Levels: (2.72,3.02] (3.02,3.32] (3.32,3.62] (3.62,3.82]
ggplot(aes(x=pH_group,y=quality),data = newData)+
geom_boxplot()
summary(pf$pH_group)
## (2.72,3.02] (3.02,3.32] (3.32,3.62] (3.62,3.82] NA's
## 627 3428 803 39 1
Ideal wine quality have the median 10.40 . This seems really unusual since I would expect quality with an ideal pH percentage to have a higher quality. compared to the other groups. There are many outliers. The variation in quality tends to increase as pH improves and then decreases for wine quality with increse in pH above median value. # Bivariate Analysis ## Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Quality correlates strongly with alcohol percentage and the pH.
As alcohol percenate increases, the variance in quality increases till median value. In the plot of quality vs alcohol.Quality of wine increases till median value of alcohol after that it’s start decreasing. The relationship between alcohol and quality is not regular.
Based on the R^2 value, alcohol explains about 43 percent of the variance in price. Other ingredients of interest can be incorporated into the model to explain the variance in the quality
The alcohol percentage and quality tend to correlate with each other. The higher the alcohol percentage , then the greater the pH .
The quality of a wine is positively and strongly correlated with alcohol and pH The variables fixed.acidity and free.sulfur.dioxide also correlate with the quality but less strongly than pH and alcohol. Either pH or alcohol could be used in a model to predict the quality of alcohol, however, both variables should not be used since they are measuring the same quality and show perfect correlation.
ggplot(data = pf,aes(x=density,y=alcohol,color=factor(quality)))+
coord_cartesian(xlim = c(0.985,1.002),
ylim = c(7.5,15))+
geom_jitter(size=1)+
geom_smooth(method = 'lm')+
scale_x_continuous(breaks = seq(0.985,1.002,0.002))+
scale_color_brewer(type = 'seq',guide=guide_legend(title='Quality Levels'))+
labs(x='Density(mg/l)',y='Alcohol(% by volume)',
title='Relationship of density and alcohol with colored quality levels')
we can see from the graph average amount of alcohol(11~12) and average density (0.989~0.997) have highest quality.We can see that if we increse density and keeping the alcohol constant quality reduced ,similarly if we keep density constand and by increasing alcohol volume quality decreses.
ggplot(data = pf,aes(x=free.sulfur.dioxide,y=alcohol,color=factor(quality)))+
coord_cartesian(xlim = c(2.00,289),
ylim = c(7.5,15))+
geom_jitter(size=1)+
geom_smooth(method = 'lm')+
scale_x_continuous(breaks = seq(2.00,289,20))+
scale_color_brewer(type = 'seq',guide=guide_legend(title='Quality Levels'))+
labs(x='Free sulphur dioxide(mg/dm^3)',y='Alcohol(% by volume)',title='Relationship of free sulphur dioxide and alcohol with colored quality levels')
we can see from the graph average amount of alcohol(11~12) and average free suphur dioxide (22~62) have highest quality.We can see that if we increse free sulphur dioxide and keeping the alcohol constant quality reduced ,similarly if we keep free sulphur dioxide constant and by increasing alcohol volume quality decreses.But there is one exception for quality level 9 .If we increse in alcohol volume and free sulphur dioxide in certain ration .It maintain the highest quality.
ggplot(data = pf,aes(x=pH,y=alcohol,color=factor(quality)))+
coord_cartesian(xlim = c(2.720,3.820),
ylim = c(7.5,15))+
geom_jitter(size=1)+
geom_smooth(method = 'lm')+
scale_x_continuous(breaks = seq(2.720,3.820,0.002))+
scale_color_brewer(type = 'seq',guide=guide_legend(title='Quality Levels'))+
labs(x='pH',y='Alcohol(% by volume)',title='Relationship of pH and alcohol with colored quality levels')
We can see that average amount of alcohol and avg pH have highest quality. we can see that for a certain alcohol percentage.If we increse the pH value from its avg value then quality decreses.For a certain alcohol percentage ,if we decresed the pH value from its avg then quality reduced.
library(RColorBrewer)
ggplot(aes(free.sulfur.dioxide,quality,color=alcohol_percentage),data=pf)+
geom_jitter(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='Alcohol Percentage',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = log10_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide and alcohol')
The plot indicates that a horizontal model could be constructed to quality of wine of variables using log10(quality) as the outcome variable and cube-root of free sulphur dioxide as the predictor variable.We can see that from the above two graph quality of wine increses till the median value of alchol percentage.By incresing alcohole percentage more than its median value wine quality decreses.
ggplot(aes(free.sulfur.dioxide,quality,color=pH_group),data=pf)+
geom_jitter(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='pH group',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,100),
breaks = c(0,20,40,60,100))+
scale_y_continuous(trans = log10_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of free sulphur dioxide and pH group')
ggplot(aes(volatile.acidity,quality,color=pH_group),data=pf)+
geom_jitter(alpha=0.5,size=1,position = 'jitter')+
scale_color_brewer(type='div',
guide=guide_legend(title='pH group',reverse = T,
override.aes = list(alpha=1,size=2)))+
scale_x_continuous(trans = cuberoot_trans(),limits = c(0,2),
breaks = c(0.5,1,1.5,2))+
scale_y_continuous(trans = log10_trans(),limits = c(2,10),
breaks = c(2,4,6,8,10))+
ggtitle('Quality(log10) by cube-root of volatile acidity and pH group')
pf$pH_group<-cut(pf$pH,c(2.720,3.02,3.32,3.62,3.82))
names(pf)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "alcohol_percentage" "pH_group"
newData<-pf[!is.na(pf$pH_group),]
head(pf$pH_group)
## [1] (2.72,3.02] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32] (3.02,3.32]
## Levels: (2.72,3.02] (3.02,3.32] (3.32,3.62] (3.62,3.82]
ggplot(aes(x=pH_group,y=quality),data = newData)+
geom_boxplot()
We can see that quality of wine increses by incresing alcohol value till median value of alcohol value.Ater incresing alcohol value more that its median value quality of wine decreses
newDataA=pf[!is.na(pf$alcohol_percentage),]
ggplot(aes(x=alcohol_percentage,y=quality),data = newDataA)+
geom_boxplot()
Idealy wines also have the have average alcohol and pH group. The variance of wine quality increses till median of alcohol percentage after that it start decresing.
The last two plots from the Multivariate section suggest that I can build a linear model and use those variables in the model to predict quality of alcohol. The results of the model are summarized below.
Increase and decrese value of quality of alcohol.You can see that by incresing the alcohol percentage till its median value quality of alcohol increses.After incresing alcohol percentage more than its median value.It’s quality decreses.
ggplot(aes(x=quality),data=pf)+
geom_bar()+
labs(title="Wine quality Distribution(barchart)")+
scale_x_continuous(limits = c(3.00,9.00),breaks = seq(3.00,9.00,1))
## Description One We can see that min quality is 3.00 and max quality is 9.00.We can see that most of the wine have avg 6.00 quality We can see that min quality is 3.00 and max quality is 9.00.We can see that most of the wine have avg 6.00 quality # Plot two
ggplot(aes(x=alcohol,y=quality),data=pf)+
geom_histogram(stat="identity",binwidth = 1)+
labs(x='Alcohol(% by Volume)',y='Quality')
We can see that min alcohol in 8.2 and max alcohol is 14.2.We can see that by incresing the amount of alcohol till median 10.40 ,Its quality is incresing.After that by incresing quantity of alcohol quality of wine decreses. # Plot Three
ggplot(data = pf,aes(x=density,y=alcohol,color=factor(quality)))+
coord_cartesian(xlim = c(0.985,1.002),
ylim = c(7.5,15))+
geom_jitter(size=1)+
geom_smooth(method = 'lm')+
scale_x_continuous(breaks = seq(0.985,1.002,0.002))+
scale_color_brewer(type = 'seq',guide=guide_legend(title='Quality Levels'))+
labs(x='Density(mg/l)',y='Alcohol(% by volume)',
title='Relationship of density and alcohol with colored quality levels')
we can see from the graph average amount of alcohol(11~12) and average density (0.989~0.997) have highest quality.We can see that if we increse density and keeping the alcohol constant quality reduced ,similarly if we keep density constand and by increasing alcohol volume quality decreses. then again by decresing by free sulphur dioxide .Quality value increses.
My Data set consist of 4898 white wines with 11 variables Data fields Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
The white wine data set contains information on almost 4898 white wines with 11 variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine across many variables and created a linear model to predict quality of alcohol.
This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. providing a rating between 0 (very bad) and 10 (very excellent). The main features in the data set are alcohole and quality I’d like to determine which ingredient are best for predicting the quality of a wine I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality of wine
(worst) —————-> (best) Quality:3,4,5,6,7,8,9 Its a continuous number
There was a clear trend between the quality of a wine and its alcohol percentage,pH value and all other variable. . I realized that most of the data contained average alcohol . For the linear model, all wines were included since information on alcohol, pH, acidity, chlorides, and sulphates were available for all the wines. After transforming quality to log scale and taking the cube root of varaible of the dataset.Most of the varibles by increasing theier quantity till median value the alcohol quality increses.After increasing more than their median value alcohol quality decreses.
Some limitations of this model include the source of the data.Given the data set has only 4898 wines data availabel.Which is not very large.These prediction might get wrong.Since it is not population data.To Investigate the data further I would like to gather much more data.I will train the data .I would like to analyze the data which factor describes more quality of wine.I would like to see which combination of ingriedents customers like more.